Explainer Notebook: A prosperous but noisy city: Noise analysis in New York

1 Motivation

What is your dataset?

The dataset we used for the project is the records of 311 Service Requests, a government hotline from the NYC open data website which reflects the daily life problems of many residents. Link
The dataset includes government hotlines' records from 2010 to the present, about a decade, covering all aspects of residents' daily life.


New yorkers can complain by visiting NYC's online customer service, text messages, phone calls, skype, etc.NYC 311 dataset covers all aspects of citizen's life in New York, which can be roughly divided into the following categories: Benefit& Support, Business& Consumers, Courts& Law, Culture& Recreation, Education, Employment, Environment, Garbage& Recycling, Government& Elections, Health, Housing& Buildings, Noise, Pets,Pests& Wildlife, Public safety, Records, Sidewalks,Streets& highways, Taxes, Transportation.

NYC311's mission is to provide the public with fast, convenient city government services and information, while providing the best customer service. It also helps organizations improve the services they offer, allowing them to focus on their core tasks and manage their workloads effectively. Meanwhile, NYC 311 also provides insights into improving city government through accurate and consistent measurement and analysis of service delivery.


Moreover,NYC311 is available 24 hours a day, 7 days a week, 365 days a year. Not only does NYC311 offer an online translation service in more than 50 languages, but users can call the 311 hotline in more than 175 languages if their language is not included.In addition, people who are deaf, hard of hearing or have language impairment can also complaint with special help such as video relay service (VRS).


We believe there is a lot of information to explore in such a large and data-rich dataset.

The material we used for this explainer notebook can be found here. You can also find all source code and dataset in the folder. For limited memory space on the laptop, the dataset has been transformed and split into several files.

Why did you choose this particular dataset?

However, it was impossible for us to conduct a comprehensive analysis of this incredible huge dataset, so after preliminary statistics, we chose the category with the most cumulative complaints over the past decade: Noise.


First of all, when it comes to environmental pollution, people may first think of air, soil, water and other aspects, but noise pollution, as an invisible and intangible existence, has the same impact on us that cannot be ignored. As a serious "urban disease", noise pollution has increasingly become the focus of modern urban life. New York, as a prosperous international city, also has such problems.


Moreover, We want to study the noise complaints in New York and analyze them from both spatial perspective and temporal perspective. We hope to learn something about the urban conditions, economic development, residents' living conditions and traffic conditions, etc, in the five boroughs of New York through the noise complaints. Moreover, we wonder whether noise complaints can be used to describe the overall development and brief condition of New York City over a 10-year period.

What was your goal for the end user's experience?

To begin with, we want to share interesting insights to the readers from noise analysis. The seemingly boring government complaints hotline actually contains many interesting insights, which not only reflect the people's life in New York, but also provide some directions and suggestions for the government to improve the city service. Also, via the analysis of noise complaints in NYC, we hope users could understand the characters, living habits, preferences and cultural backgrounds of the residents in the five different boroughs of New York.

Further more, we hope that readers can freely access the information they find useful through interactive map and interactive bar by reading the New York stories presented by us, which can not only increase readers' understanding but also make reading more participatory and interesting.

2 Basic stats

Overview of the dataset

import pandas as pd
import numpy as np
df_origin=pd.read_csv('311-2019-all.csv')
df=pd.read_csv('311-All-Concise-with-IncidentZip.csv')
/Users/shuaiwang/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (12,21,22,24,35,36,37,38,39,40,41) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/Users/shuaiwang/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

The dataset has 22.8M rows and 41 columns with size of 12GB. The dataset is shown as follows.

df_origin.head(10)
Unique Key Created Date year month day hour Closed Date Agency Agency Name Complaint Type ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 41321073 01/02/2019 04:18:15 AM 2019 1 2 4 01/28/2019 12:00:00 AM DOB Department of Buildings Building/Use ... NaN NaN NaN NaN NaN NaN NaN 40.667913 -73.762907 (40.66791269444524, -73.76290665549617)
1 41337870 01/04/2019 12:32:43 PM 2019 1 4 12 02/08/2019 12:00:00 AM DOB Department of Buildings Building/Use ... NaN NaN NaN NaN NaN NaN NaN 40.614300 -73.983799 (40.614300433949396, -73.98379900459882)
2 41341954 01/03/2019 02:01:00 PM 2019 1 3 14 01/04/2019 12:00:00 PM DSNY Bronx 04 Sanitation Condition ... NaN NaN NaN NaN NaN NaN NaN 40.844065 -73.920989 (40.84406530809191, -73.9209886008006)
3 41306471 01/01/2019 01:17:45 AM 2019 1 1 1 01/01/2019 02:23:50 AM NYPD New York City Police Department Noise - Street/Sidewalk ... NaN NaN NaN NaN NaN NaN NaN 40.685876 -73.832368 (40.68587594472689, -73.83236785129283)
4 41306914 01/01/2019 12:29:00 AM 2019 1 1 0 01/01/2019 12:30:00 AM DOT Department of Transportation Traffic Signal Condition ... NaN NaN NaN NaN NaN NaN NaN 40.669725 -73.860962 (40.66972534989857, -73.86096222619501)
5 41307247 01/01/2019 12:06:00 AM 2019 1 1 0 01/01/2019 01:20:00 AM DOT Department of Transportation Traffic Signal Condition ... NaN NaN NaN NaN NaN NaN NaN 40.819694 -73.901602 (40.819693804076756, -73.90160159492376)
6 41307542 01/01/2019 12:37:49 AM 2019 1 1 0 01/01/2019 04:36:44 AM NYPD New York City Police Department Noise - Residential ... NaN NaN NaN NaN NaN NaN NaN 40.883121 -73.892177 (40.88312100776275, -73.89217743817532)
7 41307543 01/01/2019 01:58:57 AM 2019 1 1 1 01/01/2019 03:14:29 PM NYPD New York City Police Department Noise - Residential ... NaN NaN NaN NaN NaN NaN NaN 40.827401 -73.915939 (40.82740133611878, -73.91593884606851)
8 41307738 01/01/2019 01:42:20 AM 2019 1 1 1 01/01/2019 04:29:18 AM DHS Operations Unit - Department of Homeless Services Homeless Person Assistance ... NaN NaN NaN NaN NaN NaN NaN 40.839962 -73.876797 (40.83996165662155, -73.87679735720897)
9 41309122 01/01/2019 12:00:00 AM 2019 1 1 0 01/10/2019 12:00:00 AM DOHMH Department of Health and Mental Hygiene Rodent ... NaN NaN NaN NaN NaN NaN NaN 40.738874 -73.872492 (40.7388739110531, -73.87249162291612)

10 rows × 45 columns

The attributes are shown as follows:

df_origin.columns
Index(['Unique Key', 'Created Date', 'year', 'month', 'day', 'hour',
       'Closed Date', 'Agency', 'Agency Name', 'Complaint Type', 'Descriptor',
       'Location Type', 'Incident Zip', 'Incident Address', 'Street Name',
       'Cross Street 1', 'Cross Street 2', 'Intersection Street 1',
       'Intersection Street 2', 'Address Type', 'City', 'Landmark',
       'Facility Type', 'Status', 'Due Date', 'Resolution Description',
       'Resolution Action Updated Date', 'Community Board', 'BBL', 'Borough',
       'X Coordinate (State Plane)', 'Y Coordinate (State Plane)',
       'Open Data Channel Type', 'Park Facility Name', 'Park Borough',
       'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location',
       'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp',
       'Bridge Highway Segment', 'Latitude', 'Longitude', 'Location'],
      dtype='object')

We made a bar chart to show the 15 most frequently complaint type in New York during 2010~2020 to get some inspiration.

import matplotlib.pyplot as plt
complaint_count=df['Complaint Type'].value_counts()
complaint_count.iloc[0:20]

title='The 15 most frequent complaint type in New York during 2010~2020'
to_display=complaint_count[0:15]
f,p=plt.subplots(figsize=(10,8)) 
p.bar(to_display.index,to_display.values)
p.tick_params(axis='x',labelrotation=90)
p.tick_params(labelsize=10)
p.set_title(title,fontsize=12)
p.set_xlabel('Complaint Type',fontsize=10)
p.set_ylabel('Number of cases',fontsize=10)
plt.show()

From the figure, we found noise is the most reported complain type, which inspired us to discover more about it. For temporal and spatial analysis of Noise, we think only 9 attributes are relevant and retained.

df.columns
Index(['Created Date', 'Closed Date', 'Complaint Type', 'Descriptor',
       'Location Type', 'Incident Zip', 'Borough', 'Latitude', 'Longitude'],
      dtype='object')

These attributes are used for the different purpuses.

  • Created Date\Closed Date: Used for label the time of each cases, serve for temporal analysis. It is stored in String which should be transformed to Pandas Datetime format.
  • Complaint Type: Main complaint types.It has 439 different values and provide a foundational classification of each complaint type.
  • Descriptor: For some main types, people may be confused for their names are ambiguous. Descriptor is associated to the Complaint Type, and provides further explaination. Descriptor can be seen as a set of sub-type of each Complaint Type. It has 1168 different values.
  • Location Type: Describes the type of location used in the address information. It corresponds to 'Complaint Type' as well as 'Descriptor' so that it can provide more explaination. For example, The location type, Store, corresponds to the complaint type of Noise - Commercial. It helps when the Complaint Type and Descriptor are ambiguous.
  • Incident Zip: Incident location zip code. It describes the zipcode of the block where the incident took place.
  • Borough: Name of the borough where the incident took place.
  • Latitude/Longitude: Coordinates of the incident position.

The attributes we used directly for the report are:

  • Created Date
  • Complaint Type
  • Descriptor
  • Incident Zip
  • Borough

Data preprocessing and cleaning

Datetime

Firstly, We adopt Created Data as the time when the incident happened. It has to be transformed to pandas datetime objets so that we can extract the information, e.g. year or month.

suitform='%m/%d/%Y %H:%M:%S %p'
df['TransCDatetime']=pd.to_datetime(df['Created Date'],format=suitform)
df['month']=[i.month+(i.year-2010)*12 for i in df['TransCDatetime']]
time_nan=df['TransCDatetime'].isna()
time_nan.sum()
print('The percentage of nan value of for created time is {:10.2f}%'.format(time_nan.sum()/df.shape[0]*100))
The percentage of nan value of for created time is       0.00%

We successffully transformed the format of datatime, which indicates all the elements are valid and also no NaN value is detected in the attribute.

Complaint type and Descriptor

For noise analysis, we focused on the 5 most frequently reported types. All the 5 noise types are in the 50 top reported of all types.

complaint_count=df['Complaint Type'].value_counts()
TOP_COMPLAINTS=50
cared=complaint_count.iloc[0:TOP_COMPLAINTS].index
Noise_type=[]
for i in cared:
    if 'oise' in i:
        Noise_type.append(i)
Noise_type
['Noise - Residential',
 'Noise',
 'Noise - Street/Sidewalk',
 'Noise - Commercial',
 'Noise - Vehicle']

In each main type, we also have subtypes which are shown below.

Noise_summary=dict()
for i in Noise_type:
    temp=df[df['Complaint Type']==i]
    Noise_summary[i]=temp

for i in Noise_type:
    print('The main type is', i)
    subtype=Noise_summary[i]['Descriptor'].unique()
    for j in subtype:
        print('    The subtype is',j)
The main type is Noise - Residential
    The subtype is Loud Music/Party
    The subtype is Banging/Pounding
    The subtype is Loud Talking
    The subtype is Loud Television
The main type is Noise
    The subtype is Noise, Barking Dog (NR5)
    The subtype is Noise: Construction Equipment (NC1)
    The subtype is Noise: Alarms (NR3)
    The subtype is Noise: Construction Before/After Hours (NM1)
    The subtype is Noise: Jack Hammering (NC2)
    The subtype is Noise: air condition/ventilation equipment (NV1)
    The subtype is Noise, Ice Cream Truck (NR4)
    The subtype is Noise: Private Carting Noise (NQ1)
    The subtype is Noise: Manufacturing Noise (NK1)
    The subtype is Noise, Other Animals (NR6)
    The subtype is Noise:  lawn care equipment (NCL)
    The subtype is Noise: Boat(Engine,Music,Etc) (NR10)
    The subtype is Noise: Loud Music/Nighttime(Mark Date And Time) (NP1)
    The subtype is Noise: Loud Music/Daytime (Mark Date And Time) (NN1)
    The subtype is Noise: Other Noise Sources (Use Comments) (NZZ)
    The subtype is Noise: Air Condition/Ventilation Equip, Commercial (NJ2)
    The subtype is Noise: Vehicle (NR2)
    The subtype is Horn Honking Sign Requested (NR9)
    The subtype is Noise: Air Condition/Ventilation Equip, Residential (NJ1)
    The subtype is Noise: Loud Music From Siebel System - For Dep Internal Use Only (NP21)
    The subtype is Construction Before/After Hours - For DEP Internal Use Only
The main type is Noise - Street/Sidewalk
    The subtype is Loud Talking
    The subtype is Loud Music/Party
The main type is Noise - Commercial
    The subtype is Loud Music/Party
    The subtype is Car/Truck Music
    The subtype is Loud Talking
    The subtype is Banging/Pounding
    The subtype is Car/Truck Horn
    The subtype is Loud Television
The main type is Noise - Vehicle
    The subtype is Car/Truck Music
    The subtype is Engine Idling
    The subtype is Car/Truck Horn

In summary, we have 5 maintypes and 36 subtypes, which are considered all main types and subtypes are valid. They consist of 97.8% of the whole noise cases. It is sufficient to obtain sound overall feature with 3.5M rows of data.

main_noise=df[df['Complaint Type'].str.contains('oise', regex=False)]
counts=main_noise['Complaint Type'].value_counts()
counts=counts.iloc[0:5,]
count=0
for i in df['Complaint Type']:
    if 'oise' in i:
        count+=1
print('The number of considered noise is {}'.format(counts.sum()))
The number of considered noise is 3441929
print('The percentage of cosidered noise out of the whole noise cases is {:10.1f}%'.format(counts.sum()/count*100))
The percentage of cosidered noise out of the whole noise cases is       97.8%

Incident Zip and Coordinates

We created Choropleth map for distribution of noise cases acrss different blocks in 2019, by counting the number of cases for each zipcode.

In the first place, the data quality for the ten years (2010~2020) is analyzed.

df['Incident Zip'].unique()
array([11213.0, 11221.0, 10031.0, ..., 11472.0, 10704.0, 10507.0],
      dtype=object)

Two main problems for the attribute Zipcode have been detected:

  • NaN values
  • Zipcode with invalid characters, e.g. alphabet

It is necessary to figure out the the percentage of the valid values. It is calculated as follows.

# verify each item if they have the following problems: nan, invalid character
import re
zipnan=df['Incident Zip'].isna()
zipnan=zipnan.to_numpy()
zipalph=[]
for i in df['Incident Zip']:
    a=(re.search('[a-zA-Z]', str(i))!=None)
    b=(re.search('[-]', str(i))!=None)
    zipalph.append(a and b)
zipalph=np.array(zipalph)
percentage=zipalph.sum()+zipnan.sum()
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(percentage/df.shape[0]*100))
The percentage of invalid value of the whole dataset is       5.79%

The percentage of invalid values is 5.79%, which is acceptable because we mainly focus on the overall distribution and trend of some focused features.

However, in the interactive map, we presented the noise distribution in 2019 so that a particular attention should be paid to the data in this year.

df['year']=[i.year for i in df['TransCDatetime']]
df_2019=df[df['year']==2019]
import re
zipnan1=df_2019['Incident Zip'].isna()
zipnan1=zipnan1.to_numpy()
zipalph1=[]
for i in df_2019['Incident Zip']:
    a=(re.search('[a-zA-Z]', str(i))!=None)
    b=(re.search('[-]', str(i))!=None)
    zipalph1.append(a and b)
zipalph1=np.array(zipalph)
percentage=zipalph1.sum()+zipnan1.sum()
print('The percentage of invalid value for 2019 is {:10.2f}%'.format(percentage/df_2019.shape[0]*100))
The percentage of invalid value for 2019 is       3.16%

We have seen that it is of better quality compared to the dataset (3.16% of 2019 to 5.79% of 2010~2020), which indicates improvement in data collection by the government.

But we still wanted to do correction for the invalid values in 2019. K-nearest-neighbours(KNN) is the machine learning algorithm can be adopted for this problem because the zipcode is determined by coordinates of the point. Therefore, the first thing came to our mind is to calculate the probability of invalid coordinates given invalid zipcode.

First, we should examine coordinates. After initial inspection, we found that NaN is involved in coordinates.

latnan1=df_2019['Latitude'].isna()
latnan1=latnan1.to_numpy()
print('The percentage of invalid value of coordinates for 2019 is {:10.2f}%'.format(latnan1.sum()/df_2019.shape[0]*100))
The percentage of invalid value of coordinates for 2019 is       5.31%
a=df_2019['Latitude'].isna() & df_2019['Longitude'].isna()
b=df_2019['Latitude'].isna()
print('Total number of NaN in Latitude is {}'.format(a.sum()))
print('Total number of NaN in Latitude or Longitude is {}'.format(b.sum()))
Total number of NaN in Latitude is 130434
Total number of NaN in Latitude or Longitude is 130434

The two numbers are equal, which means that if NaN is present in Latitude, it is also present in the correspoding longitude.

After removing NaN value, we used boxplot to see if there is outliers in coordates in terms of value range.

f,p=plt.subplots(1,2,sharex=True,figsize=(20,5))
font=18
#titledict={'x':0.02,'y':0.9}
p[0].set_title('Latitude of noise cases',fontsize=font)
p[0].boxplot(df_2019[~b]['Latitude'])
p[0].tick_params(labelsize=font)
p[1].set_title('Longitude of noise cases',fontsize=font)
p[1].boxplot(df_2019[~b]['Longitude'])
p[1].tick_params(labelsize=font)
plt.show()

After removing the NaN values, all the cocordinates are in the territorial scope of New York City. We considered no other outliers included.

And then we are going to calculate the probability of invalid coordinate given invalid zipcode.

notused=0
for i in range(df_2019['Incident Zip'].shape[0]):
    if latnan1[i] and zipnan1[i] and ~zipalph1[i]:
        notused+=1
print('The percentage of invalid coordinate given invalid zipcode{:10.2f}%'.format(notused/percentage*100))
The percentage of invalid coordinate given invalid zipcode     99.83%

It means that for the invalid zip code, it is 99.83% likely not having its coordinates. Therefore KNN will not be effective and it is also inferred that if the government did not record the zipcode, they also did not get the position of the case.

Based on above analysis, we discarded the invalid values for zipcode and it will not have influence on the conclusion of the analysis.

Borough

We create a intearactive bar chart displaying distributions of various noise types in different boroughs.

In the first place, the data quality for the ten years (2010~2020) is analyzed.

df['Borough'].unique()
array(['BROOKLYN', 'MANHATTAN', 'BRONX', 'STATEN ISLAND', 'QUEENS',
       'Unspecified'], dtype=object)

It is shown that the invalid value is 'Unspecified', for which we have calculated its percentage in the whole dataset.

unspecified_whole=(df['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_whole.sum()/df.shape[0]*100))
The percentage of invalid value of the whole dataset is       5.35%

The percentage of invalid values is 5.35%, which is acceptable to discard the invalid values because we mainly focus on the overall distribution and trend of some focused features. However, in the interactive bar chart, we presented distributions of various noise types in different boroughs in 2019 so that a particular attention should be paid to the data quality for this year.

unspecified_2019=(df_2019['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_2019.sum()/df_2019.shape[0]*100))
The percentage of invalid value of the whole dataset is       0.91%

We have seen that it is of better quality compared to the dataset (0.91% of 2019 to 5.35% to 2010~2020), which indicates improvement in data collection by the government.

Because 0.91% is quite a small percentage, we discarded the unspeicifed value and it will not have a influence on the conclusion of the analysis.

Summary of the dataset after cleaning and preprocessing

Because the dataset covers a great number of complaint types, it is necessary to narrow it down to the main ones to obtain the main trends and features of noise in the New York city. After data cleanning and preprocessing, the dataset only contains the necessary attributes for the report. The datasize has 22662415 rows and 12 colomns (10 of original attributes, year and month are derived attributes).

df.head(10)
Created Date Closed Date Complaint Type Descriptor Location Type Incident Zip Borough Latitude Longitude TransCDatetime month year
0 03/27/2014 12:00:00 AM 03/29/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11213 BROOKLYN 40.668545 -73.926207 2014-03-27 12:00:00 51 2014
1 03/27/2014 12:00:00 AM 03/29/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11221 BROOKLYN 40.689455 -73.935520 2014-03-27 12:00:00 51 2014
2 03/27/2014 12:00:00 AM 03/31/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 10031 MANHATTAN 40.819232 -73.952935 2014-03-27 12:00:00 51 2014
3 03/27/2014 12:00:00 AM 03/28/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 10467 BRONX 40.875128 -73.875174 2014-03-27 12:00:00 51 2014
4 03/27/2014 03:34:00 PM 03/21/2014 01:36:00 AM Street Light Condition Street Light Out NaN 10312 STATEN ISLAND 40.547109 -74.166197 2014-03-27 03:34:00 51 2014
5 03/27/2014 03:26:00 PM 03/14/2014 10:30:00 PM Street Light Condition Street Light Out NaN 10301 STATEN ISLAND 40.640741 -74.081307 2014-03-27 03:26:00 51 2014
6 03/27/2014 12:00:00 AM 03/28/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 10456 BRONX 40.831120 -73.919226 2014-03-27 12:00:00 51 2014
7 03/27/2014 12:00:00 AM 03/27/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 10035 MANHATTAN 40.799635 -73.940867 2014-03-27 12:00:00 51 2014
8 03/27/2014 12:00:00 AM 04/02/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11206 BROOKLYN 40.702106 -73.936153 2014-03-27 12:00:00 51 2014
9 03/27/2014 12:00:00 AM 03/29/2014 12:00:00 AM HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 10453 BRONX 40.857149 -73.902504 2014-03-27 12:00:00 51 2014

3 Data analysis

Sum up main types and sub types.

counts=main_noise['Complaint Type'].value_counts()
counts=counts.iloc[0:5,]
plt.figure(figsize=(8,6))
counts.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each main type (The 5 most frequently)',fontsize=15)
plt.xlabel('Main noise type',fontsize=12)
plt.ylabel('Number of cases',fontsize=12)
plt.show()

The most frequently main type is Noise - Residiential, which shows that the noise cases are mostly reported by the residents. Below, we also sum up all the subtypes of noises.

sub_noise=main_noise['Descriptor'].value_counts()
plt.figure(figsize=(12,8))
sub_noise.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each subtype',fontsize=15)
plt.xlabel('Descriptor(Subtype of noise)',fontsize=12)
plt.ylabel('Number of cases',fontsize=12)
plt.show()

Plotting the monthly trend of main types

f,p=plt.subplots(len(Noise_type),figsize=(12,10),constrained_layout=True)
#f.tight_layout()
m=0
month_range=np.arange(df['month'].min(),df['month'].max()+1)
month_range_scarce=np.arange(df['month'].min(),df['month'].max()+1,5)
for i in Noise_type:
    monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
    drawn=df[df['Complaint Type']==i]['month'].value_counts()
#    print('I am doing ', i)
    for j in drawn.index:
        monthly.loc[j]=drawn[j]
    p[m].bar(month_range,monthly[month_range])
    p[m].set_title(i,size=10)
    p[m].tick_params(axis='x',labelrotation=90)
    p[m].set_ylim(0,1.2*monthly.max(axis=0))
#    p[m].tick_params(labelsize=30)
    p[m].set_xticks(month_range_scarce)
    p[m].set_xlabel('Month')
    p[m].set_ylabel('Number of cases')   
    m+=1

We have observed that for the five main crime types, they all show an increasing trend from 2010 to 2020 and seasonal fluctuation.

We can obtain more information if the monthly trend of each subtype is plotted.

Plotting the monthly trend of sub types

m=0
n=0   
f,p=plt.subplots(18,2,figsize=(70,150))
for i in Noise_type:  
    subtype=Noise_summary[i]['Descriptor'].unique()
    
    plt.subplots_adjust(hspace = 0.4)
    for j in subtype:
        monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
        drawn=Noise_summary[i][Noise_summary[i]['Descriptor']==j]['month'].value_counts()
        for k in drawn.index:
            monthly.loc[k]=drawn[k]
        p[m][n].bar(month_range,monthly[month_range])
        p[m][n].set_title(i+':  '+j,size=30)
        p[m][n].tick_params(axis='x',labelrotation=90)
        p[m][n].set_ylim(0,1.2*monthly.max(axis=0))
        p[m][n].tick_params(labelsize=30)
        p[m][n].set_xticks(month_range_scarce)
        p[m][n].set_xlabel('Month',fontsize=30)
        p[m][n].set_ylabel('Number of cases',fontsize=30)
        n+=1
        if n==2:
            m+=1
            n=0

After initial analysis, we focuses only on the subtype of noise with complete data (all available from 2010 to 2020). Generally they show the seasonal pattern, more cases in the summer while less in the winter. Besides that, we sorted them into three catogories in terms of overall trend.

  • Ascending trend:most of the subtypes are in ascending trend, mostly relevant to human activity. e.g. Loud Music/Party, Loud Talking.
  • Stable: only a few, mostly irrelevant to human activities, e.g. Barking Dog.
  • Dscending trend: only one, Jack Hammering.

Analysis of coordinates distribution

from scipy.stats import gaussian_kde
main_noise=main_noise[~np.isnan(main_noise['Latitude'])]
font=18
# histogram
f,p=plt.subplots(2,1,figsize=(8,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Latitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Latitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Latitude'])
m,n=np.histogram(main_noise['Latitude'],bins=50)
p[0].set_ylabel('Number of cases')
p[1].plot(n,density(n))
#p[1].tick_params(labelsize=font)
p[1].set_xlabel('Latitudes')
p[1].set_ylabel('Density')
plt.show()
f,p=plt.subplots(2,1,figsize=(8,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Longitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Longitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Longitude'])
m,n=np.histogram(main_noise['Longitude'],bins=50)
p[0].set_ylabel('Number of cases')
p[1].plot(n,density(n))
#p[1].tick_params(labelsize=font)
p[1].set_xlabel('Latitudes')
p[1].set_ylabel('Density')
plt.show()

Based on the histogram, we observed how the coordinates are distributed and the pattern fits the territorial shape of New York city.

Relational analysis with other types

These plots analyze 5 noise with other 49 (245 mappings all together) the highist rank complain type relationships.

Each point in each plot is the total number in every month between two complain types. For example, in the plot of Illegal Parking and Noise - Residential, one point in (7807,2559) is mean in 2011-August, there are 7807 cases of Noise - Residential and 2259 cases of Illegal Parking.

We try to find some plots with strong or interesting connection relation in between five noise and other conplain types.

from scipy import stats
df = pd.read_csv('311-ac-monthGroupList.csv')

filename2 = '311-All-Cmplaint Type-Groupby.csv'

dfNoise = ['Noise - Residential' , 'Noise - Street/Sidewalk' ,'Noise - Commercial' ,'Noise - Vehicle','Noise']
nr = ['Illegal Parking', 'Blocked Driveway', 'Noise', 'Sewer', 'PAINT - PLASTER', 'Noise - Commercial']
ns = ['Noise - Residential', 'Illegal Parking', 'Blocked Driveway', 'HEATING', 'Traffic Signal Condition', 'Damaged Tree', 'Rodent', 'Consumer Complaint', 'New Tree Request', 'Overgrown Tree/Branches', 'Maintenance or Facility', 'Elevator', 'Root/Sewer/Sidewalk Condition']
nc = ['Noise - Residential', 'Illegal Parking', 'Blocked Driveway', 'PLUMBING', 'Water System', 'GENERAL CONSTRUCTION', 'Noise', 'Noise - Street/Sidewalk', 'Taxi Complaint']
nv = ['Noise - Residential', 'HEAT/HOT WATER', 'Street Light Condition', 'HEATING', 'Noise - Street/Sidewalk', 'UNSANITARY CONDITION', 'Traffic Signal Condition', 'Sewer', 'Dirty Conditions', 'Sanitation Condition', 'Rodent', 'Building/Use', 'Derelict Vehicles', 'Consumer Complaint', 'Graffiti', 'New Tree Request', 'Overgrown Tree/Branches', 'Maintenance or Facility', 'Elevator', 'Root/Sewer/Sidewalk Condition', 'Food Establishment']
no = ['Noise - Residential', 'HEATING', 'Noise - Street/Sidewalk', 'Traffic Signal Condition', 'Dirty Conditions', 'New Tree Request', 'Overgrown Tree/Branches', 'Root/Sewer/Sidewalk Condition', 'Illegal Parking', 'Blocked Driveway', 'PLUMBING', 'General Construction/Plumbing', 'Noise - Commercial', 'Broken Muni Meter', 'Taxi Complaint']

dfT50 = pd.read_csv(filename2).head(n=50)
# plot all plot with five noise and 49 others
for i in dfNoise:
    f1,p1 = plt.subplots(7, 7, figsize=(40,40))
    k=0
    n=0
    ci = 0 
    
    for j in dfT50['Complaint Type']:
        temp = df[df["Complaint Type"]==i]
#         print(i)
        if i==j:
            continue
        temp2 = df[df["Complaint Type"]==j]
        len1 = len(temp)
        len2 = len(temp2)
#        print(str(len1)+i+'-'+str(len2)+j)
        if len1>len2:
            temp = temp.head(n = len2)
        else:
            temp2 = temp2.head(n = len1)
        slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
        p1[n][k].scatter(temp['monthSize'],temp2['monthSize'],marker='o')
        p1[n][k].plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',label='slope: '+str(round(slope, 2)))
        p1[n][k].set_xlabel(i)
        p1[n][k].set_ylabel(j)
        p1[n][k].legend(loc="upper left")
        if(k == 6):
            n=n+1
            k=0
        else:
            k=k+1
    name = i.replace('/', '-', 1)
#     f1.savefig('img/connection-'+name+'.png')

Scatter plot analysis

The following plots were selcted considred that they deserve further exploration.

select = [nr,ns,nc,nv,no]
total = len(nr)+len(ns)+len(nc)+len(nv)+len(no)
# select all necessary plot
f1,p1 = plt.subplots(8, 8, figsize=(40,40))
k=0
n=0
big = 0
small = 0 
s1 = []
s2 = []
for kk in range(5):
    i = dfNoise[kk]
    cat =select[kk]
# for i in dfNoise:
    for j in cat:
        temp = df[df["Complaint Type"]==i]
        temp2 = df[df["Complaint Type"]==j]
        len1 = len(temp)
        len2 = len(temp2)
        if len1>len2:
            temp = temp.head(n = len2)
        else:
            temp2 = temp2.head(n = len1)
        slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
        if slope > 0:
            big = big+1
            s1.append([i,j,slope])
        elif slope <= 0:
            small = small+1
            s2.append([i,j,slope])
        p1[n][k].scatter(temp['monthSize'],temp2['monthSize'],marker='o')
        p1[n][k].plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',\
                      label='slope ~ R: '+str(round(slope, 2))+' ~ '+str(round(r_value, 2)))
        p1[n][k].set_xlabel(i)
        p1[n][k].set_ylabel(j)
        p1[n][k].legend(loc="upper left")
        if(k == 7):
            n=n+1
            k=0
        else:
            k=k+1
            
# name = i.replace('/', '-', 1)
#     f1.savefig('img/connection-'+name+'.png')
print("slope>0:"+str(big))
print("slope<=0:"+str(small))
slope>0:47
slope<=0:17

After analyzed the plot above, the result show a positive correlation Illegal Parking with 'Noise - Residential' , 'Noise - Street/Sidewalk' ,'Noise - Commercial' and 'Noise(others)'.

Is this a coincidence? Or in the real world, there is indeed a connection between them, we use folium to draw a map to explore whether there is a spatial relationship.

car_noise = ['Noise - Residential']
park = 'Illegal Parking'

f1,p1 = plt.subplots(figsize=(10,5))
f1.suptitle('The value relationship between Illegal Parking and other four kind noise during 2010-2020')

big = 0
small = 0 
for kk in car_noise:
    
    temp = df[df["Complaint Type"]==kk]
    
    temp2 = df[df["Complaint Type"]==park]
    
    len1 = len(temp)
    len2 = len(temp2)
    if len1>len2:
        temp = temp.head(n = len2)
    else:
        temp2 = temp2.head(n = len1)
            
    slope, intercept, r_value, p_value, std_err = stats.linregress(temp['monthSize'],temp2['monthSize'])
    p1.scatter(temp['monthSize'],temp2['monthSize'],marker='o')
    p1.plot(temp['monthSize'], intercept + slope*temp['monthSize'], 'r',\
                  label='slope ~ R: '+str(round(slope, 2))+' ~ '+str(round(r_value, 2)))
    p1.set_xlabel(kk)
    p1.set_ylabel(park)
    p1.legend(loc="upper left")

Map structure introduction

Here, we present Illegal Parking and Noise - Residential to see their relationship. In the folium map, we grapped 500 random GPS sample points from each catagory during year 2010-2020. The background heat map is for displaying Illegal Parking distribution, then the points are sample from Noise - Residential.

Illegal Parking and Noise - Residential

In this map, The noisy concentrated areas are gathered near a large number of illegal parking. This is in line with certain practical conditions. It is likely that the noisy areas are densely populated and parking spaces are scarce.

But this phenomenon is not very clear, it can not give a directly explaination of relationship between Illegal Parking and Noise - Residential.

import folium
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
df_whole = pd.read_csv('311-GPS-noise-parking.csv')
# the relationship between Illegal Parking and Noise - Residential
# Map show
borough_map='Borough Boundaries.geojson'
# Do the plotting with sampling: Scatter plot
map_hooray_scatter=folium.Map(location=[40.7128, -74.0060],tiles = "Stamen Toner",zoom_start=10.5)
selected=df_whole[df_whole['Complaint Type']=='Illegal Parking'].sample(500)
selected=selected[~np.isnan(selected['Latitude'])]
cmlist=selected[['Latitude','Longitude']].values.tolist()

selected_NR=df_whole[df_whole['Complaint Type']=='Noise - Residential'].sample(500)
selected_NR=selected_NR[~np.isnan(selected_NR['Latitude'])]
cmlist_NR=selected_NR[['Latitude','Longitude']].values.tolist()

folium.GeoJson(borough_map).add_to(map_hooray_scatter)
# for i in cmlist:
#     folium.CircleMarker(i,width=30, height=30, radius=3.5, weight=2.0, color='#0000CC', fill_color='#0066FF', opacity=0.75, fill_opacity=0.5).add_to(map_hooray_scatter)

for i in cmlist_NR:
    folium.CircleMarker(i,width=30, height=30, radius=3.5, weight=2.0, color='#0000CC', fill_color='#0066FF', opacity=0.75, fill_opacity=0.5).add_to(map_hooray_scatter)
HeatMap(cmlist,max_zoom=1000000,radius=20).add_to(map_hooray_scatter)
map_hooray_scatter

If Relevant talk about your machine leanrning.

For this project, the focus is about statistical analysis, visualization and story-telling. No machine learning problems are involved in the analysis, except that we planned to use K-nearest-neighbours to make some correction for the default or invalid values in the attribute 'Incident Zip'. As described in the data cleaning section, it is impossible to implement KNN for mostly both coordinates and zipcode are missing at the same time.

4 Genre

Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal & Heer). Why?

For visual narrative, we chose the interactive slideshow, which we thought would be a good way to balance author-driven and reader-driven stories. There is an overall time narrative structure (e.g., slideshow), however, at some point, the user can manipulate the interaction visualization(interactive map and interactive bar in this project) to see more detailed information so that the reader can better understand the pattern or extract more relevant information (e.g., via interacting with a slideslideshow). Readers should also be able to control the reading progression themselves.For highlighting, zooming is conducted by us, readers can further explore the details that arouse their interests.

Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal & Heer) Why?

Linear ordering is selected by us in order to form a complete story line, hover details and selection are conducted in interactive parts. We maintain these can increase the reader's sense of participation and interactivity in reading. In the messaging section, headlines, annotations,introductory and summary are used. The headline give the readers the guidance about the specific content of the article while the annotation help readers get more information description.The introduction plays the role of arousing readers' interest and attracting them to further reading, while the summary conclude the content and stimulate readers' thinking, both of which give readers have a complete concept of the whole story.

5 Visualizaition

Explain the visualizations you've chosen.

  • Interactive choropleth map for distribution of noise cases acrss different blocks

It is an interactive choropleth map which shows not only overall distribution of the reported cases but also detailed information of each block.

The color of one block indicates how many repored noise cases per hectare in it and readers can easily get a good understanding of the overall distribution with reference to the color bar.

Besides, when you put your mouse on a maker and click it, you will get the zip number, block name and the number of cases per hectare.

  • Percentage of various noise types in different boroughs

It is an interactive bar that shows the distribution of top ten noise subtypes in the five boroughs of New York.

We sorted out the top 10 sub-noise types in terms of frequency and calculatd the percentage for each borough. The x axis presents the 10 noise type while the y axis illustrates the percentage for each borough. When the mouse is moved onto the bar, it shows the accruate value of percentage.

Why are they right for the story you want to tell?

From interactive choropleth map and bar chart, readers can get a general understanding of the problem but also the detailed information as their interest. Also, we provide our own story line to tell the readers what we have found and want them to know and use necessary supplementary material (statsitics and images) to help readers better understand. These storyline origniates from the phenomenon presented in the interactive visualization. Therefore, we think they are the right tools for the report.

6 Discussion

What went well?

  • There were invalid values in the dataset but they consistuted quite a small proportion(less than 5%) of the data that we were concered, which did not influence the consequence if we discarded them.
  • All the codes worked well and the result fits our genenral understanding of the problem but also the relevant information we obtained from the Internet.
  • We also found the right visualization tool to present our ideas.

What could be improved? Why?

  • The interactive choropleth map is divided by blocks of different zipcode. We have observed that the size varies a lot across the blocks. All the data were sorted into some large block, which has led to the situation where people cannot observe the distribution in the large block. We noticed that when we zoom in Manhattan and found some small blocks of high density and then realized that the uneven distribution in the large block had been ignored. Heat map can be used to solve this problem but it is not interactive and then cannot provide detailed information that we wanted to present to the readers.
  • Our analysis was combined with the information from other sources we thought related to the phenomenon. It has explained something but in some cases we did not deep into. We believe more exploration into some problems is worthy and but information and advanced mathematical tools are demanded.
  • There may be other interesting aspects of data that deserve to be explored. Heat water problem is the second most frequently reported category, which may also contain some interesting insight. Also, the relationship between different noise types is also worthy to explore. We have plotted the relationship between different types of complaints and found a lot of interesting points. But we thought the report would contain too much information and have kept some preliminary study in the explainer notebook.

7 Contribution

  • Qing Zheng(s192159) has been the main responsible for the interactive bar chart and analysis, genre control in the explainer notebook, as well as video directing and producing.
  • Zhijie Song(s200093) has been the main responsible for the interactive folium map and analysis in the report, data cleaning/preprocessing and statistical analysis for both report and video.
  • Shuai Wang(s200108) has been the main responsible for the statistical visulization in the report, and part of data cleaning/preprocessing and statistical analysis for both report and video.

This project was completed on May 15 2020 by DTU master students

as a final project in the course 02806 Social Data Analysis and Visualization organized by DTU Professor Sune Lehmann Jørgensen.